Skip to content

Conversation

@jfedorov
Copy link
Contributor

@jfedorov jfedorov commented Sep 22, 2025

Description

This is an initial implementation of PTI Synchronous Callback API.
Current implementation serves to implement on top of it PTIMetricsScope API (will be added soon) - to collect hardware metrics per individual GPU operations via Event Query mechanism.

This PR

  • implement for 2 domains only, for immediate command list only
  • make callback sample
  • add sample to tests and run asan and tsan on it

Area of the change

  • PTI SDK
  • Unitrace
  • Other Tool(s) - e.g. Sysmon, or any code above SDK directory (except Unitrace)
  • Infrastructure - e.g. GitHub workflows, and other whole repo impacted

Type(s) of change

Choose one or multiple, leave empty if none of the other choices apply

  • Bug fix - change that fixes an issue
  • New feature - change that adds functionality
  • Performance improvement - change that lowers profiling overhead
  • Tests - change in tests
  • Samples - change in samples
  • Documentation - documentation update

Tests

  • Added - required for new features and some bug fixes
    Tests will be added in the next PR
  • Not needed

Specific HW and OS where to run the test unless generic:

For example, 2 discrete GPUs, integrated GPU, specific GPU model, PyTorch integration test(s)

Checklist

  • Have all tests, except Quarantined, passed locally?
  • Do all newly added source files have a license header?

Details on API(s) or command line option(s) changes

  • API(s) or command line options not changed
  • New API or command line options added
  • Existing API(s) or command line options changed - so backward compatibility broken
  • Unknown

If applies - details on the broken backward compatibility

Indicate what API(s) backward compatibility or option(s) is broken, why it might be OK, or suggest on how to deal with it moving forward

Notify the following users

@jmellorcrummey , @Thyre , @anmyachev, @yuninxia @mschilling0, @Rogersyp

Other information

- implement for 2 domians only, append to immediate cmd list only
- make callback sample
- add sample to tests and run asan and tsan on it

Signed-off-by: jfedorov <[email protected]>
@jfedorov jfedorov requested a review from Copilot September 22, 2025 10:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements an initial version of the PTI Synchronous Callback API, designed to collect hardware metrics per individual GPU operations via event query mechanism. The implementation covers two domains for immediate command lists and includes a sample demonstrating the functionality.

  • Adds a complete callback subscription system with subscriber management
  • Implements callback support for GPU operation appended and completed domains
  • Creates a callback sample demonstrating matrix multiplication with callback hooks

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
sdk/include/pti/pti_sync_callback.h New public API header defining callback domains, phases, and function types
sdk/src/levelzero/ze_collector_cb_helpers.h New helper classes for managing callback subscribers and domain properties
sdk/src/levelzero/ze_collector.h Extensive updates adding callback subscriber management and callback invocation logic
sdk/src/pti_view.cc Implementation of callback API functions with exception handling
sdk/src/pti_view_load.cc API function wrappers for callback functionality
sdk/src/view_handler.h View handler methods for callback operations
sdk/src/pti_lib_handler.h Function pointer declarations for callback APIs
sdk/src/pti.cc String conversion utilities for callback enums
sdk/src/unikernel.h Addition of KernelCommandType enum and command_type_ field
sdk/samples/callback/main.cc Complete sample application demonstrating callback API usage
sdk/samples/callback/CMakeLists.txt Build configuration for callback sample
sdk/samples/samples_utilities/samples_utils.h Utility functions for callback data dumping and command list type conversion
sdk/src/gen_tracing_callbacks.py Code generator updates to support callback functionality
sdk/test/CMakeLists.txt Test configuration updates to include callback sample
sdk/CMakeLists.txt Build system updates to include callback sample

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

// - All times reported by PTI_VIEW in CPU (aka Host) timescale
// - However GPU "commands" (kernel & memory transfers) start/end reported in GPU timescale
// - There is significant time drift between CPU and GPU, so to cope wth it - need to
// - There is significant time drift between CPU and GPU, so to cope with it - need to
Copy link

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed typo 'wth' to 'with' in the comment.

Copilot uses AI. Check for mistakes.

void PostAppendKernelCommandCommon(ZeCollector* /*collector*/, ZeKernelCommand* command,
ZeKernelCommandProps& props, ze_event_handle_t& signal_event,
ze_event_handle_t& signal_event,
Copy link

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function signature changed to remove the ZeKernelCommandProps& props parameter, but the function is still being called with the old signature in some places. Ensure all call sites are updated to match the new signature.

Suggested change
ze_event_handle_t& signal_event,

Copilot uses AI. Check for mistakes.
Comment on lines +270 to +274
static_cast<ze_command_list_handle_t>(callback_data->_queue_handle));
if (res) {
std::cout << "Prepared data for Append" << std::endl;
res = zeCommandListAppendWriteGlobalTimestamp(
static_cast<ze_command_list_handle_t>(callback_data->_queue_handle),
Copy link

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is casting _queue_handle to ze_command_list_handle_t, but based on the API design, _queue_handle should be a queue handle, not a command list handle. This type mismatch could lead to runtime errors.

Suggested change
static_cast<ze_command_list_handle_t>(callback_data->_queue_handle));
if (res) {
std::cout << "Prepared data for Append" << std::endl;
res = zeCommandListAppendWriteGlobalTimestamp(
static_cast<ze_command_list_handle_t>(callback_data->_queue_handle),
callback_data->_command_list_handle);
if (res) {
std::cout << "Prepared data for Append" << std::endl;
res = zeCommandListAppendWriteGlobalTimestamp(
callback_data->_command_list_handle,

Copilot uses AI. Check for mistakes.
Comment on lines +270 to +274
static_cast<ze_command_list_handle_t>(callback_data->_queue_handle));
if (res) {
std::cout << "Prepared data for Append" << std::endl;
res = zeCommandListAppendWriteGlobalTimestamp(
static_cast<ze_command_list_handle_t>(callback_data->_queue_handle),
Copy link

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the previous issue, _queue_handle is being incorrectly cast to ze_command_list_handle_t when it should be treated as a queue handle. This could cause the Level Zero API call to fail or behave unexpectedly.

Suggested change
static_cast<ze_command_list_handle_t>(callback_data->_queue_handle));
if (res) {
std::cout << "Prepared data for Append" << std::endl;
res = zeCommandListAppendWriteGlobalTimestamp(
static_cast<ze_command_list_handle_t>(callback_data->_queue_handle),
callback_data->_command_list_handle);
if (res) {
std::cout << "Prepared data for Append" << std::endl;
res = zeCommandListAppendWriteGlobalTimestamp(
callback_data->_command_list_handle,

Copilot uses AI. Check for mistakes.
@jfedorov jfedorov mentioned this pull request Sep 22, 2025
Copy link

@Thyre Thyre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for giving the option to provide feedback on this interface.
I think that this generally goes in the right direction. I have a few comments and questions regarding a few things.

This is obviously only the first step towards having a callback interface, as one can see with only having a limited amount of domains, operation kinds and so on.

I think some parts can be made a bit more clear with more documentation, e.g. when one would expect to see cb_data to actually be set. At the moment, a tool developer would just need to guess when this is the case, and hope that he casts to the right struct.
I'd guess that one needs to cast to pti_callback_gpu_op_data for PTI_CB_DOMAIN_DRIVER_GPU_OPERATION_* and to pti_internal_callback_data for
PTI_CB_PHASE_INTERNAL_*?


Please also note that I only looked at the headers here. I'm not familiar with the internals of the interface, and would need quite a bit more time to get familiar with that for a proper review.

//!< provided TimestampCallback
PTI_ERROR_BAD_API_ID = 7, //!< invalid api_id when enable/disable runtime/driver specific api_id
PTI_ERROR_BAD_API_ID = 7, //!< invalid api_id when enable/disable runtime/driver specific api_id
PTI_ERROR_AT_LEAST_ONE_GPU_VIEW_MUST_BE_ENABLED = 8, //!< at least one GPU view must be enabled for kernel tracing
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PTI_ERROR_AT_LEAST_ONE_GPU_VIEW_MUST_BE_ENABLED = 8, //!< at least one GPU view must be enabled for kernel tracing
PTI_NO_GPU_VIEW_ENABLED = 8, //!< at least one GPU view must be enabled for kernel tracing

I think having a shorter name for the error is actually helpful here.
If developers are interested in having a string description of their error code, one could think about implementing a function like const char* ptiGetErrorMessage(...), similar to CUPTI: https://docs.nvidia.com/cupti/api/group__CUPTI__RESULT__API.html

@@ -0,0 +1,227 @@
//==============================================================
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about the file name. pti_sync_callback.h sounds to me like we're only dealing with synchronization stuff here, although this looks to be all callback related event handling.
Can we find a better name for this? Maybe even just drop the _sync part?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was a suggestion from @jmellorcrummey - to name it in a way to pronounce that these callbacks are synchronous. may be I misunderstood
if you and @jmellorcrummey (and anyone) can suggest better name - this would be great

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, but I think using the abbreviation sync might be a bit unfortunate. I'm also not the best person for naming things though, so I don't have a better idea 😄
The naming of this file should certainly not be a blocker. Even having a short description in the file that this contains synchronous callbacks is sufficient to clearly communicate what this is about.

Comment on lines +14 to +15
* This file contains APIs that so far experimental in PTI
* APIs and data structures in this file are work-in-progress and subject to change!
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* This file contains APIs that so far experimental in PTI
* APIs and data structures in this file are work-in-progress and subject to change!
* This file contains APIs that are so far experimental in PTI
* APIs and data structures in this file are work-in-progress and subject to change!

Comment on lines +44 to +46
PTI_CB_DOMAIN_DRIVER_GPU_OPERATION_APPENDED = 4, //!< This also serves as PTI_CB_DOMAIN_DRIVER_GPU_OPERATION_DISPATCHED
//!< when appended to Immediate Command List,
//!< which means no separate callback PTI_CB_DOMAIN_DRIVER_GPU_OPERATION_DISPATCHED
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding:
The comment basically states that attaching an event to an immediate command list only dispatches this event. We still get an event when the operation has completed, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. there still be event on the completion.
But append and dispatch "merged" into one for Immediate command list

Comment on lines +53 to +55
PTI_CB_DOMAIN_DRIVER_API = 1023, //!< Not implemeted yet,
//!< attempt to enable it will return PTI_ERROR_NOT_IMPLEMENTED
//!< Callback created for all Driver APIs
Copy link

@Thyre Thyre Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you intend to have something like a RUNTIME_API as well or would this be handled in any other way? Would that even make sense with Level Zero, since runtime is most likely something like SYCL, OpenCL or other programming paradigms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRIVER_API here exactly refers to Level Zero. OpenCL would be also "driver level" (when/if supported)

May be one day we might want to add Callbacks to "user"-side /runtime APIs, that many users directly us - e.g. think oneCCL.

pti_backend_queue_t _queue_handle; //!< Device back-end queue handle
pti_device_handle_t _device_handle; //!< Device handle,
pti_callback_phase _phase; //!< Could be ONLY_NOTIFY or API Call ENTER/EXIT
uint32_t _return_code; // will be valid only for L0 API EXIT, for others will be nullptr
Copy link

@Thyre Thyre Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullptr wouldn't really be valid for uint32_t, should probably also be a defined value as suggested above.

Comment on lines +136 to +138
uint32_t _detail; // depending on the domain should be casted/interpreted
// as a purpose of an internal PTI thread or
// pti_internal_event_type
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint32_t _detail; // depending on the domain should be casted/interpreted
// as a purpose of an internal PTI thread or
// pti_internal_event_type
uint32_t _detail; // For THREAD START/END, this describes the internal PTI thread purpose.
// For INTERNAL EVENT, this describes the pti_internal_event_type.

Is there any way to get the internal PTI thread purpose?
Having a number without any actual meaning feels a bit pointless, even if we have the message in the same struct.

Comment on lines +144 to +145
pti_api_group_id driver_api_group_id, // driver API group ID, keep it to distinguish between L0 and OpenCL
// although the current implementation is only for L0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always the driver_api ID, or may we also get other event types here?
PTI_API_GROUP_SYCL would be a runtime group_id, just by looking at the comments.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest just dropping the driver_ part from this argument and the next one. This gives a bit more flexibility in the future without breaking code building on top of this API.

Comment on lines +170 to +173
pti_result PTI_EXPORT
ptiCallbackSubscribe(pti_callback_subscriber_handle* subscriber,
pti_callback_function callback,
void* user_data);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a question out of curiosity. Do you plan to support more than one subscriber at a time?
CUPTI certainly won't allow more than one, I don't know about the AMD APIs. The OpenMP Tools Interface also only allows one, but one can multiplex this interface, as demonstrated by LLVM.

Having the option to attach multiple subscribers can have its benefit, with the main drawback obviously being the potential overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Current design/implementation supports multiple subscribers. Right - having multiple ones would bring additional overhead,. plus we also need explain how they should work together.

Comment on lines +187 to +188
* @param enter_cb - indicate if callback called on enter/start: 0-no, 1-yes; used only for domains with 2 sites
* @param exit_cb - indicates if callback called on exit/end: 0-no, 1-yes; used only for domains with 2 sites
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One should probably document which domains support these two parameters.
How should one call this function with these domains if he want to enable this domain?

What would happen if one passes false for both enter_cb and exit_cb?
Would the domain be disabled? Should this return an error code?

@jfedorov
Copy link
Contributor Author

@Thyre , thank you for your comments.
I might not be able to answer /address them all today/tomorrow. But as I will be back - I will go overall of these and others,
May be in a short term @mschilling0 and @Rogersyp can comment.

@Thyre
Copy link

Thyre commented Sep 22, 2025

@Thyre , thank you for your comments. I might not be able to answer /address them all today/tomorrow. But as I will be back -

There's no need to rush things 😄 ( at least not from my side )
Thanks again for accepting the feedback on this at all. I think this helps everyone at the end.

@mschilling0
Copy link
Contributor

@Thyre , thank you for your comments. I might not be able to answer /address them all today/tomorrow. But as I will be back - I will go overall of these and others, May be in a short term @mschilling0 and @Rogersyp can comment.

I left some comments internally, I'll take some of the API comments and put them here.


#ifndef PTI_SYNC_CALLBACK_H_
#define PTI_SYNC_CALLBACK_H_

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include <stdint.h>

So we don't have implicit dependencies on other header files.

#ifndef PTI_SYNC_CALLBACK_H_
#define PTI_SYNC_CALLBACK_H_

#include "pti/pti_metrics.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include metrics in callbacks if metrics is going to be a user of the callbacks?

#include "pti/pti_view.h"

/**
* This file contains APIs that so far experimental in PTI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* This file contains APIs that so far experimental in PTI
* This file contains APIs that so far experimental in PTI.

* This file contains APIs that so far experimental in PTI
* APIs and data structures in this file are work-in-progress and subject to change!
*
* All in this file concerns Callback API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* All in this file concerns Callback API
* All content in this file concerns Callback API.

* APIs and data structures in this file are work-in-progress and subject to change!
*
* All in this file concerns Callback API
* Callback API is useful for many things,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Callback API is useful for many things,
* The Callback API is useful for many purposes,

*
* All in this file concerns Callback API
* Callback API is useful for many things,
* including to the implementation of MetricsScope functionality that wants to subscribe for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* including to the implementation of MetricsScope functionality that wants to subscribe for
* including to the implementation of `MetricsScope` functionality that needs to subscribe to

* All in this file concerns Callback API
* Callback API is useful for many things,
* including to the implementation of MetricsScope functionality that wants to subscribe for
* kernel append to command list .. and may be to other events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* kernel append to command list .. and may be to other events.
* events like kernel appends to a command list, and potentially other events.

// although the current implementation is only for L0
uint32_t driver_api_id,
pti_backend_ctx_t backend_context, //!< Driver (L0) level context handle
void* cb_data, //!< depending on the domain it should be type-casted to the pointer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since both pti_callback_gpu_op_data and pti_internal_callback_data contain _domain and _phase can we seperate those into a base record and make this,

pti_callback_base {
.. _domain
.. _phase.
}

pti_callback_base* cb_data

then cast based on the value of cb_data->domain?

struct _pti_callback_gpu_op_data {
pti_callback_base*
}

then I think you can remove domain from the callback too but not sure about that

// or API related to GPU operation submission.
} pti_callback_gpu_op_data;

typedef struct _pti_internal_callback_data {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inconsistently named with _pti_callback_gpu_op_data.

Should it be _pti_gpu_op_callback_data ? or vice versa

void* user_data);

/**
* @brief Unsubscribe Callback subscriber, this unsubscribes from all domains, disables callback,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @brief Unsubscribe Callback subscriber, this unsubscribes from all domains, disables callback,
* @brief Unsubscribe Callback subscriber. This unsubscribes from all domains, disables the callback,


/**
* @brief Unsubscribe Callback subscriber, this unsubscribes from all domains, disables callback,
* clean all resources related to the subscriber handle and invalidate the handle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* clean all resources related to the subscriber handle and invalidate the handle
* cleans up all resources related to the subscriber handle, and invalidate the handle.

*
* @param subscriber - subscriber handle
* @param domain - domain to enable
* @param enter_cb - indicate if callback called on enter/start: 0-no, 1-yes; used only for domains with 2 sites
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no/yes should probably be an enum. The API is ambiguous this way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants